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Abstract. Online forums or message boards are rich knowledge-based 
communities. In these communities, thread retrieval is an essential tool 
facilitating information access. However, the issue on thread search is 
how to combine evidence from text units(messages) to estimate thread 
relevance. In this paper, we first rank a list of messages, then we score 
threads by aggregating their ranked messages' scores. To aggregate the 
message scores, we adopt several voting techniques that have been ap- 
plied in ranking aggregates tasks such as blog distillation and expert find- 
ing. The experimental result shows that many voting techniques should 
be preferred over a baseline that treats a thread as a concatenation of 
its message texts. 
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1 Introduction 

Online forums are virtual places(communities) that facilitate seeking and shar- 
ing knowledge through in depth discussions. A user starts a discussion through 
posting an initial message, then other users read the initial message and answer 
it through reply messages. The initial message and its replies form a threaded 
discussion(thread). One challenge in accessing information in forums is infor- 
mation overload. Thread retrieval is one way to tackle it. However, the actual 
contents are not the threads but the messages. Therefore, given a query, a re- 
trieval system must infer the thread relevance using the message text. In that 
aspect, thread retrieval resembles ranking aggregates tasks such as blog feed 
retrieval[14 4 6] and expert finding]?]. In these tasks, given a query, the objec- 
tive is to rank aggregates (blogs, experts) by leveraging associated text units 
(blogs' postings, experts' writings) [5] . An analogy between ranking aggregates 
and thread retrieval is that threads are the aggregates, and messages are the 
associated texts or documents. 

Voting techniques performed well in ranking aggregates tasks |7|6)8j . How- 
ever, the effectiveness of each voting technique varies between tasks and datasets 
[8 . In addition to that, threads have a conversational structure that does not ex- 
ist in other ranking aggregates contexts. In threads, the meaning of a message is 
fully understood within its discussion context. Furthermore, messages are mostly 
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replies, hence they tend to be shorter than blogs' postings and experts' writings. 
In other words, that might alter the performance of voting techniques. In this 
paper, we review several voting methods and investigate their performance on 
thread retrieval. 

2 Voting in Thread Retrieval 

Voting techniques were first proposed by j7j to the expert finding task. In voting 
techniques, we first rank a list of documents (e.g. expert's writings) based on 
their relevance to the given query. Then, we rank aggregates(e.g., the experts) 
based on their scores obtained from fusing their ranked documents' scores or 
ranks. Similarly, in this work, given a query Q = {qi,q2, we first rank a 

list of messages Rq with respect to Q. Then, we score threads by aggregating 
their ranked messages' scores or ranks. In addition, threads are ranked based on 
their obtained aggregated scores in a descending order. 

In estimating the relevance between the query Q and a message M, we employ 
the query language model assuming term independence, uniform probability 
distribution for M and Dirichlet smoothing as follows [T?]: 

where q is a query term, fi is the smoothing parameter. n(q, M) and n(q, Q) are 
the term frequencies of q in M and Q respectively, \M\ is the number of tokens 
in M, and P(q\C) is the collection language model. The outputs of P(Q\M) and 
P(Q\C) are probabilistic values. 

To rank threads, the twelve aggregation methods proposed by [7] are adapted: 
Votes, Reciprocal Rank(RR), BordaFuse, CombMIN, CombMAX, CombMED, 
CombSUM, CombANZ, CombMNZ, expCombSUM, expCombANZ and exp- 
CombMNZ. In addition to these methods, this study uses CombGNZ — the 
geometric mean of the relevance scores. We use this method because it is the 
aggregation method employed by |5ll3j . 

In these methods, the relevance between a thread T and Q, rel(T, Q), is the 
score obtained through the aggregation of all T's ranked messages Rt as shown 
below: 



rely otes (Q,T) = \R T \ 



(2) 




1 



(3) 



rank(Q, M) 



reWdaFuscCQ,? 1 ) = Y, \ R q\ 



rank(Q, M) 



(4) 



MeR T 



reZ C ombMiN(Q,T) = MIN m , Rt P{Q\M) 



(5) 



rd C ombMAx(<2,T) = MAX M€Rt P(Q\M) 



(6) 
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rel C ombMEu(Q, T) = Median M eR T P(Q\M) (7) 
rel CombsVM (Q,T)= Y, HQ\M) (8) 

MeR. T 

reZ CombA Nz(Q,T 1 ) = ^-x £ P(Q\M) (9) 

H T MeR T 

n p(q m ) (io) 

re/ C on 1 bMNz(<9,T) = | J R T |x £ P(Q|M) (11) 

MeR T 

reZ exp combSUM(Q,T) = £ exp(P(Q|M)) (12) 

MeR T 

rel cxpC ombANz(Q,T) = -^-x Y, cxp(P(Q\M)) (13) 

\ H T\ MiR T 

re/ex P CombMNz(Q,T) = | J R T |x £ exp(P(Q|M)) (14) 

where rank(Q,M) is the rank of the message M in i?Q, |i?Q| is the size of Rq, 
and \Rt\ is the number of T's ranked messages. 

As an illustrative example, let Rq = {Mi, M2, M3, M4, M$, Mq} denote a list 
of ranked messages, where there are 3 threads associated with these messages 
Ti,T 2 and T 3 ; and, M x belongs to T u M 2 and M 3 belong to T 2 and M 4 , M 5 
and Mq belong to T3. In addition, let the relevance scores between the user 
query and these messages assigned by query language relevance model to be 
0.06, 0.05, 0.04, 0.03, 0.02 and 0.01 respectively, whereas the ranks of these 
messages are 1,2,3,4,5,6. Then, we calculate the relevance between the given 
query Q and the thread T 6 using the Votes, the CombSUM and the BordaFuse 
aggregation methods as follows: rely otcs {Q, T 6 ) = \R Te \ = 3, reZcombSUM(<3, Ts) = 
P(Q\M 4 ) + P(Q\M 5 ) + P(Q\M e ) = 0.04 + 0.05 + 0.06 = 0.16, rel BoldaFvse (Q,T 6 ) = 
6 - rank{Q, M 4 ) + 6 - rank{Q, M 5 ) + 6 - rank(Q, M 6 ) = 2 + 1 + = 16. 



3 Related Studies 

The voting techniques approach to the ranking aggregates tasks are inspired by 
works on data fusion (meta search) [15|ll)lj . A meta search algorithm aims to 
combine several ranked lists of documents into a unified list . These ranked lists 
are generated by various retrieval methods. The essences of the data fusion are 
two folds[H)). First, the more retrieval methods retrieve a particular document, 
the more the document is expected to be relevant to the user query. Second, 
a document that is ranked at top ranking positions by many retrieval methods 
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might be more relevant than a one that was found at the bottom of several 
ranked lists. Data fusion can be categorized into score based and rank based 
aggregation methods. The score based methods — such as [TS]'s CombMAX, 
CombMIN, CombMED and CombSUM methods, use the relevance scores of 
documents, whereas the rank based methods, pQ, utilize the ranking positions 
of these documents on the ranked lists. 

|7l6j approached the problem of ranking aggregates as a data fusion problem: 
each document is an evidence about its parent aggregate's relevance to the query. 
Generally, the the voting approach was found to be statistically superior to 
baseline methods |7|6| . However, the performance of each voting technique was 
not consistent across tasks[S]: the CombMAX method, which performed well on 
the expert finding setting, was significantly worse than the baseline methods on 
the blog distillation setting^. Therefore, how will these methods perform on 
the thread retrieval task is the focus of this study. 

Several combination techniques have been proposed to address evidence com- 
bination for thread retrieval. [5] proposed two strategies to rank threads: inclu- 
sive and selective. The inclusive strategy utilizes evidence from all messages in 
order to rank parent threads. Two models from previous work on blog site re- 
trieval [3] were adapted to thread search: the large document and the small 
document models. The large document model creates a virtual document for 
each thread by concatenating the thread's message texts, then it scores threads 
based on their virtual document relevance to the query. In contrast, the small 
document model defines a thread as a collection of text units (messages). Then, 
it scores threads by adding up their messages relevance scores. In contrast to the 
inclusive strategy, [5]'s selective strategy treats threads as collections of mes- 
sages; and it uses only few messages to rank threads. Three selective methods 
were used. The first one is scoring threads using only the initial message rele- 
vance score. The second method scores threads by taking the maximum score 
of their message relevance scores. The third method is based on the Pseudo 
Cluster Selection(PCS) method[14j. PCS scores threads in two steps: it scores 
a list of messages, then it ranks threads by taking the geometric mean of the 
top k ranked messages' scores from each thread. Generally, it was found that 
the selective models are statistically superior to the inclusive models 5 3 . Our 
work extends this selective strategy by investigating more aggregation methods. 
In addition, PCS focuses on the top k ranked messages, whereas we focus on all 
ranked messages. Applying voting techniques as aggregation methods in PCS is 
an interesting problem, but we leave it for a separate study. 

Another line of research is the multiple context retrieval approach proposed 
by [H] . This approach treats a thread as a collection of several local contexts — 
types of self-contained text units. Four contexts were proposed: posts — identical 
to messages, pairs, dialogues and the entire thread. The thread and post con- 
texts are identical to [S]'s virtual document and message based representations. 
In the pair and the dialogue contexts, the conversational relationship between 
messages is exploited to build text units. In the pair context, for each pair of 
messages mi,mj that have a reply relationship — rrij is a reply to m,, a text unit 
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is built by concatenating their texts. In the dialogue context, for each chain of 
replies that starts by the initial message; and, there is a reply relation between 
each message and its neighbour in the chain, a text unit is built by concatenat- 
ing the chain's message texts. To rank threads using the post, pair and dialogue 
contexts, PCS was used. It was observed that the retrieval using the dialogue 
context outperformed retrieval using other contexts. Additionally, the weighted 
product between the thread context and the dialogue contexts achieved the best 
performance. In our work, we are focusing on how to combine the ranked con- 
texts' relevance scores. Therefore, our work is complementary to |13|'s work. 

The third line of work is the structure based document retrieval proposed 
by [5] ■ In this approach, a thread consists of a collection of structural components: 
the title, the initial message and the reply messages set. In this representation, 
the thread relevance to the user query is estimated using |10) 's inference network 
framework. Our work can be applied to [2]'s representation as well. We could 
use [2]'s inference based relevance score the same way the thread context score 
was used in [13] . 

4 Experimental Design 

Thread retrieval is a new task, and the number of test collections is limited. In 
this study, we used the same corpus used by [2J. It has two datasets from two 
forums — UbuntrQ and Trave]^] forums. The statistics of the corpus is given in 
Table [I] Text was stemmed with the Porter stemmer, and stopword removal was 
applied at the ranking stage. In conducting the experiments, we used the Indri 
retrieval systerrr] 



As for evaluation, we use [5]'s virtual document model VD as a baseline. This 
model has been used as a strong baseline in previous studies [511312] . For each 
query, we calculated the standard used measures on Ad Hoc retrieval [3]: Preci- 
sion at 10 (P@10), Normalized Discounted Cumulative Gain at 10 (NDCG@10), 

1 ubuntuforums.org 

http:/ /www. tripadvisor.com/ShowForum-g28953-i4-New York.html 
3 http:/ /www. lemurproject.org/indri.php 




Table 1. Statistics of test collection 



Ubuntu Travel 



No of threads 113277 83072 

No of users 103280 39454 

No of messages 676777 590021 

No of queries 25 25 

No of judged threads 4512 4478 
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Mean Reciprocal Rank(MRR) and Mean Average Precision (MAP). In all ex- 
periments, we used the same relevance protocol followed in |2|13j . a thread is 
considered as relevant if its relevance level is greater or equal to 1 — if it is 
partially or highly relevant; and, it is irrelevant if the relevance level is zero. 

As for parameter estimation, we estimated the smoothing parameters, /x, for 
the virtual document and message language models. In addition, for all voting 
techniques, we estimated the size of the initial ranked list of messages Rq. To 
estimate //. we varied its value from 500 up to 4000; adding 500 in each run. To 
estimate the size of Rq , we varied its value from 500 up to 5000 adding 500 in 
each run. Then, an exhaustive grid search was applied to maximize MAP using 
5-fold cross validation. 

5 Result and Discussion 



Table 2. Retrieval performance of the voting methods on the Ubuntu dataset 



Method 


MAP 


MRR 


P 


@10 


NDCG@10 


VD 





.3437 


0. 


.7258 





.4200 


0. 


.3284 


CombGNZ 





.2272 v 


0. 


,4974 T 





.2760 v 


0. 


,1971 v 


Votes 





.2749 v 


0. 


.6550 





.4680 


0. 


.3551 


RR 





.3313 


0. 


.6287 





.4600 A 


0. 


.3428 


Bordafuse 





.3153 


0. 


.6913 





.5080 


0. 


.3778 


CombMIN 





1779 v 


0. 


,5000 T 





.2600 v 


0. 


,1849 v 


CombMAX 





.3074 v 


0. 


.6420 





.4480 


0. 


.3257 


CombMED 





.2212 v 


0. 


5021 T 





.2760 v 


0. 


1927 v 


CombSUM 





.3100 


0. 


.6667 





.4720 


0. 


.3633 


CombANZ 





.2314 v 


0. 


4971 T 





.2800 v 


0. 


1991 v 


CombMNZ 





.3108 


0. 


.6933 





.4880 


0. 


.3720 


expCombSUM 





.3088 


0. 


.6933 





.4840 


0. 


.3676 


expCombANZ 





.2315 v 


0. 


4971 T 





.2800 v 


0. 


1991 v 


expCombMNZ 





.3088 


0. 


.6933 





.4840 


0. 


.3676 



The symbols and A denote statistically significant improvements over the virtual 
document model (VD) at p-value < 0.01 and 0.05 respectively using paired randomiza- 
tion test. Similarly, v and T denote statistically significant degradations over (VD) at 
p-value < 0.01 and 0.05 respectively. 

Tableland Table [3] present the retrieval performance of the voting methods 
on thread retrieval for the Ubuntu dataset and the Travel dataset respectively. 
Several observations can be found from the data shown in these tables. The 
first observation is the performance of the aggregation methods as compared 
to the baseline method — the virtual document(VD) model. In high precision 
measures (P@10 and NDCG@10), RR, BordaFuse, CombSUM, CombMNZ, ex- 
pCombSUM, expCombSUM and expCombMNZ are able to produce better or 
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Table 3. Retrieval performance of the voting methods on the Travel dataset 



Method 


MAP 


MRR 


P 


@10 


NDCG@10 


VD 


0. 


.3774 


0. 


.6967 





.4800 





.3549 


CombGNZ 


0. 


2001 v 





.4838 T 





.3320 v 





.2319 v 


Votes 


0. 


3066 v 





.7491 





.5080 





.4063 


RR 


0. 


3155 v 





.6120 





.4520 





.3431 


BordaFuse 


0. 


3630 





.7547 





.5640 





.4350 


CombMIN 


0. 


1574 v 





.4843 T 





,3040 v 





.2199 v 


CombMAX 


0. 


2724 v 





.5754 





.4360 





.3216 


CombMED 


0. 


2004 v 





.4841 T 





.3480 v 





,2388 v 


CombSUM 


0. 


3668 





.8000 





.5560 





.4440* 


CombANZ 


0. 


2065 v 





.4841 T 





.3400 v 





.2346 v 


CombMNZ 


0. 


.3575 





.7790 





.5280 





.4205 


expCombSUM 


0. 


.3513 





.7937 





.5200 





.4109 


expCombANZ 


0. 


2065 v 





.4841 T 





.3400 v 





,2346 v 


expCombMNZ 


0. 


.3513 





.7937 





.5200 





.4109 



The symbols and A denote statistically significant improvements over the virtual 
document model (VD) at p-value < 0.01 and 0.05 respectively using paired randomiza- 
tion test. Similarly, v and T denote statistically significant degradations over (VD) at 
p-value < 0.01 and 0.05 respectively. 



comparable result with respect to VD. These methods favour threads with highly 
ranked messages. In contrast, CombGNZ, CombMED, CombANZ, CombMIN 
and expCombANZ might be effected by threads that have a lot of low scored 
messages. This behaviour was also reported in applying voting techniques to ex- 
pert finding[6]. Therefore, based on [7]'s conclusion, we assert that highly ranked 
messages are good indicators of relevant threads. 

To confirm this conclusion, the effects of varying the size of the initial ranked 
list was studied. As Figure [l] and Figure [2] show, the retrieval performance de- 
creases as the size gets relatively big (more than 1000). In addition, one can see 
that almost all methods suffer from this problem except RR and CombMAX. 
This is expected because RR and CombMAX address the problem of low scored 
messages inherently. RR adds up the inverse of the messages' ranks, thus it penal- 
izes threads with a lot of low ranked messages. In the case of CombMAX, it takes 
only the best scoring message; therefore, if no threads are introduced as the size 
increases, the order of threads will not change. That explains the convergence 
of CombMAX and RR and the consistent decrement of the other methods. This 
was replicated with other measures such as P@10 and NDCG@10(Not shown in 
this paper) as well. This indicates the importance of highly ranked messages to 
thread retrieval. 

Another observation is the importance of utilizing non score signals. For in- 
stance, the Votes method's performance is relatively good as compare to other 
methods. Similarly, CombMNZ, which makes use of the number of ranked mes- 
sages in addition to sum of scores, has similar performance as well. All of these 
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Fig. 1. The performance of aggregation methods as the size of the initial ranked list 
increases on the Ubuntu dataset. 
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Fig. 2. The performance of aggregation methods as the size of the initial ranked list 
increases on the Travel dataset. 



methods leverage information that is not coming from scores: the number of 
ranked messages. Nevertheless, exhaustive emphasis on these signals will hurt the 
performance. One could see that from fast decrement of Votes and CombMNZ 
methods as size increases. One possible reason is that adding up low scores has 
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less impact than multiplying by the number of these messages; CombSUM's 
decrement is always less than those of the Votes and the CombMNZ methods. 

Although the voting methods improvements are not statistically significant, 
they are consistent on both datasets and require only using the message index. 
That gives the voting approach an extra advantage over the virtual document 
model because it coincides with what users contribute, hence it frees the retrieval 
system from re-concatenating messages into a virtual document whenever a new 
message is created or edited. 

6 Conclusion 

In this paper, we studied applying voting techniques to online forums thread re- 
trieval. We used thirteen voting methods that aggregate ranked messages scores 
or ranks in order to score the parent threads. The experimental result shows that 
voting techniques— RR, BordaFuse, CombSUM, CombMNZ, expCombSUM, ex- 
pCombSUM and expCombMNZ, that favour threads with highly ranked mes- 
sages produced comparable or better performance as compare to baselines; and, 
none of them is a winning method. Although the observed improvements were 
not statistically significant, we recommend using the voting methods because 
their improvements are consistent across datasets, and they coincide with what 
users contribute. 

Nevertheless, this paper finding has motivated us to further study the effects 
of voting techniques when aggregating only the top k messages. Another future 
direction is incorporating these voting methods into |13| 's multiple context mod- 
els. Similar approach will be applied to incorporate the structural component 
representation of [2]. 
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